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STOCHASTIC  APPROXIMATIONS  FOR 
FINITE  -  STATE  MARKOV  CHAINS 

by 

D.-J.  Ma1,  A.  M.  Makowski1  and  A.  Shwartz2 
University  of  Maryland  and  Technion 

ABSTRACT 

In  constrained  Markov  decision  problems,  optimal  policies  are  often  found  to  depend  on 
quantities  which  are  not  readily  available  due  either  to  insufficient  knowledge  of  the  model 
parameters  or  to  computational  difficulties.  This  motivates  the  on-line  estimation  (or  compu¬ 
tation)  problem  investigated  in  this  paper  in  the  context  of  a  single  parameter  family  of  finite- 
state  Markov  chains.  The  computation  is  implemented  through  an  algorithm  of  the  Stochastic 
Approximations  type  which  recursively  generates  on-line  estimates  for  the  unknown  value.  A 
useful  methodology  is  outlined  for  investigating  the  strong  consistency  of  the  algorithm  and 
the  proof  is  carried  out  under  a  set  of  simplifying  assumptions  in  order  to  illustrate  the  key 
ideas  unencumbered  with  technical  details.  An  application  to  constrained  Markov  decision 
processes  is  briefly  discussed. 


Keywords:  Markov  chains,  Recursive  Estimation,  Stochastic  Approximation,  Regularity, 
Implementation,  Adaptive  Control. 
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1.  INTRODUCTION 


It  is  well  known  that  many  questions  concerning  Markov  decision  processes  (MDP’s)  can 
be  reduced  to  a  search  for  Markov  stationary  policies  which  satisfy  certain  constraints  (or  op¬ 
timality)  conditions.  However,  the  authors  argued  in  [5]  that  the  resulting  Markov  stationary 
policies  are  usually  not  readily  implement  able,  sometimes  in  spite  of  strong  structural  prop¬ 
erties.  This  is  so  because  the  values  of  the  model  parameters  may  not  be  available  [4,9],  and 
even  if  they  were  available,  the  policy  may  still  not  be  implementable  due  to  computational 
difficulties  inherent  to  its  definition  [9]. 

In  this  paper,  the  discussion  is  given  in  the  context  of  finite-state  MDP’s.  It  is  assumed 
that  the  policy  g  of  interest  belongs  to  a  one-parameter  family  of  Markov  stationary  policies 
{fn  ,0  <  rj  <  1}  and  that  the  parameter  value  r}*  characterizing  g  is  specified  by  J(fr<)  =  V 
for  some  given  scalar  V ,  where  J(q)  is  the  cost  incurred  by  using  an  admissible  policy  -7.  The 
problem  of  interest  is  the  on-line  estimation  (or  computation)  of  the  parameter  rj*,  and  is  solved 
here  through  an  adaptive  algorithm  of  the  Stochastic  Approximation  type.  The  adaptive  policy 
a  defined  through  this  estimation  algorithm  is  shown  to  incur  the  same  cost  as  the  policy  g,  i.e., 
J(a)  =  J(g),  thus  simultaneously  resolving  the  above-mentioned  implementation  difficulties. 

This  problem  is  motivated  in  Section  3  via  an  example  from  the  theorey  of  constrained 
MDP’s,  which  provides  the  intuition  behind  the  proposed  adaptive  algorithm.  The  convergence 
results  for  the  estimates  and  for  the  cost  under  the  adaptive  policy  a  are  presented  in  Section 
4.  The  method  of  proof  uses  the  ODE  method  as  discussed  by  Metivier  and  Priouret  [7],  but 
the  specific  structure  of  the  model  at  hand  allows  for  great  simplifications  in  their  arguments. 
The  required  regularity  properties  are  derived  in  Section  5  under  minimal  conditions  on  the 
transition  probabilities,  and  the  main  estimate  that  underlies  the  use  of  the  ODE  method  is 
developed  in  Section  6.  Section  7  concludes  with  an  application  to  constrained  MDP’s  and  an 
extension  of  the  results  to  models  with  weaker  regularity  properties. 

A  few  words  on  the  notation  used  throughout  the  paper:  The  set  of  all  real  numbers  is 
denoted  by  IEt,  and  1(A)  stands  for  the  indicator  function  of  a  set  A.  Unless  stated  otherwise, 
limn,  fimn  and  lim„  are  taken  with  n  going  to  infinity. 
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2.  MODEL  AND  ASSUMPTIONS: 


Assume  the  state  space  to  be  a  finite  set  S  of  cardinality  d  and  let  the  control  space  U 
be  an  arbitrary  measurable  space.  The  one-step  transition  mechanism  P  is  defined  through 
the  one-step  transition  probability  functions  pxy{ •)  :  U  — *  IR  which  are  assumed  to  be  Borel 
measurable  and  to  satisfy  the  standard  properties  0  <  pxy(u )  <  1  and  J2yPxy{u)  —  1  f°r  x 
and  y  in  S ,  and  all  u  in  U.  The  space  of  probability  measures  on  U  (when  equipped  with  its 
natural  Borel  cr-field)  is  denoted  by  M. 

The  sample  space  D  :=  S  X  (U  x  5) 00  is  the  canonical  space  for  the  MDP  ( S ,  U,  P).  The 
coordinate  mappings  {Pn}o°  and  {Xn}o°  are  defined  by  setting  Un(u) )  un  and  Xn(w)  :=  xn 

for  all  n  =  0,1, _  The  sample  space  H  is  equipped  with  the  cr-field  F  V£L0Fn  where 

Fn  :=  cr{Xo,U0,Xi,. . .  ,Un-i,Xn)  for  all  n  =  0,1,...,  so  that  the  mappings  {Un}g°  and 
{Xn}o°  are  all  random  variables  (RV). 

An  admissible  control  policy  7  is  defined  as  any  collection  {7n)o°  °f  conditional  distribu¬ 
tions  on  U,  i.e.  ,  for  all  n  =  0, 1,...,  the  RV  u>  — ►  7 n(A,w)  is  Fn-measurable  for  every  Borel 
subset  A  of  U,  with  the  interpretation  that  7n(-,o>)  is  the  probability  distribution  for  selecting 
the  control  value  Un  given  the  feedback  information  (Xo(oj),U0(oj),Xi(u>),  ...,Un-i(u),Xn(u>)). 
Denote  the  collection  of  all  such  admissible  policies  by  IT. 

Let  p,  be  a  fixed  probability  distribution  on  S.  For  every  admissible  policy  7  in  II,  the  Kol¬ 
mogorov  Extension  Theorem  then  guarantees  the  existence  (and  uniqueness)  of  a  probability 
measure  P1  on  the  cr-field  F  so  that  under  P1 ,  the  RV  X0  has  distribution  p,  and 

=  y  I  Fn]  =  J  '1n{du)pXny(u)  n  =  0,1,...  (2.1) 

for  all  y  in  S.  The  expectation  operator  associated  with  7  is  denoted  by  E 7. 

A  policy  7  in  II  is  said  to  be  a  Markov  or  memoryless  policy  if  there  exists  a  family  {^n}o° 
of  mappings  gn  :  S  ->  M  such  that  7n(-)  =  gn(-,Xn)  P^  -  a.s.  for  all  n  =  0, 1,. . .  In  the  event 
the  mappings  {ffnlo  are  afi  identical  to  a  given  mapping  g  :  S  — »  1M,  the  Markov  policy  is 
termed  stationary  and  will  be  identified  with  the  mapping  g  itself.  For  any  Markov  stationary 
g,  define  the  d  X  d  matrix  P(g )  =  ( pxy(g ))  by  posing 


for  all  x  and  y  in  S . 


3.  THE  IMPLEMENTATION  PROBLEM  —  AN  EXAMPLE 

For  any  mapping  c  :  S  — *  1R,  define  the  corresponding  long-run  average  cost  functional 
Jc  :  II  — » 1R  by  posing 

Jc{i)  :=  limn— 

n  +  1 

for  every  admissible  policy  7  in  IT. 

The  problem  of  interest  here  is  to  find  a  Markov  stationary  policy  g  such  that  J(g)  =  V , 
with  V  some  real  constant  determined  through  various  design  considerations.  Consider  the 
situation  where  there  exist  two  implementable  Markov  stationary  policies  g  and  g  such  that 

Jc[q)  <  P  <  Jc{g),  (3.2) 

i.e.,  the  Markov  stationary  policy  g  (resp.  g)  undershoots  (resp.  overshoots)  the  requisite 
performance  level  V.  This  situation  arises  naturally  in  the  solution  of  constrained  MDP’s  via 
Lagrange  arguments,  and  is  discussed  in  Section  7.  For  every  7  in  the  unit  interval  [0,1],  the 
policy  fv  obtained  by  simply  randomizing  between  the  two  policies  g  and  g  with  bias  rj  is  the 
Markov  stationary  policy  determined  through  the  mapping  fv:S->M  where 

P  (•,  x)  :=  rj  g(-,  x)  +  (1  -  77)  g(-,  x )  (3.3) 

for  all  x  in  S.  Note  that  for  rj  =  1  (resp.  rj  =  0),  the  randomized  policy  fv  coincides 
with  g  (resp.  g).  Owing  to  the  condition  (3.2),  if  the  mapping  7  ->  Je(f*)  is  continuous 

on  the  interval  [0,1],  then  at  least  one  randomized  strategy  p*  meets  the  value  V  and  its 
corresponding  bias  value  7*  is  a  solution  of  the  equation 

Jc(P)=:V,  r)  in  [0,1],  (3.4) 

whence  g  =  p‘  steers  (3.1)  to  the  value  V. 

Solving  the  (highly)  nonlinear  equation  (3.4)  for  the  bias  value  rj*  is  usually  a  non-trivial 
task,  even  in  the  simplest  of  situations  [8].  The  implementation  a  of  the  policy  g  which  is 


U=o 


(3.1) 
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defined  below  circumvents  this  difficulty  by  bypassing  a  direct  solution  of  the  equation  (3.4). 
The  proposed  implementations  a  =  {ojn}o°  has  the  form 


-H"n)  • —  Vn  -^-n)  d-  (l  'Hn)  {?('>  X n ) 


n  =  0,  (3.5) 


where  {vn}^  is  some  sequence  of  [0,l]-valued  RV’s  which  play  the  role  of  “estimates”  for  the 
bias  value  p* . 

In  many  applications,  the  mapping  r)  — >  Jc{fv)  is  monotone,  say  monotone  increasing 
for  sake  of  definiteness.  The  search  for  rj*  can  then  be  interpreted  as  finding  the  zero  of  the 
monotone  function  r/  — »  Jc{fv)  —V  and  this  brings  to  mind  ideas  from  the  theory  of  Stochastic 
Approximations.  Here,  this  circle  of  ideas  suggests  generating  a  sequence  of  bias  values  {r?n}§° 
through  the  recursion 


Vn+i  — 


Vn  ~h  (H 


-  e(*n+1)) 


l 

o 


n  =  0, 1,...  (3.6) 


with  rj o  given  in  [0,1].  In  (3.6),  the  notation  [a]J  =  0  V  (x  A  1)  is  used  for  every  x  in  R,  and 
the  sequence  of  step  sizes  {an}o°  satisfies  the  conditions 


OO  CO 

o  <  an  l  0,  ^  an  =  oo,  ^  a2n  <  oo. 

n=0  n=0 


(3.7) 


4.  THE  RESULTS 

The  purpose  of  this  note  is  to  provide  mild  conditions  under  which  (i)  the  estimates 
{^nlo0  °f  V*  generated  through  (3.6)  are  strongly  consistent  under  Pa  and  (ii)  the  policies  g 
and  ol  achieve  the  same  cost.  These  results,  which  are  discussed  in  the  remainder  of  the  paper, 
hold  for  more  general  situations  with  (3.3)  replaced  by  a  one-parameter  family  of  stationary 
policies  {Z^O  <  7)  <  1}  such  that  f1  =  £  and  f°  =  g  satisfy  (3.2).  Under  a  monotonicity 
assumption,  the  same  reasoning  leads  to  the  sequence  of  bias  values  generated  by  (3.6)  and 
to  an  implementation  a  of  g  also  given  by  (3.5).  This  more  general  formulation  is  assumed 
thereafter  and  the  assumptions  of  interest  can  now  be  stated  as  conditions  (Cl)-(C3),  where 

(Cl)  Under  each  policy  fn,  the  RV’s  {Xn}o°  form  an  aperiodic  Markov  chain  with  a  single 
recurrent  class; 
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(C2)  The  transition  probabilities  7/  -»•  pxy{fv )  are  analytic  on  [0,  l]  for  all  x  and  y  in  S. 
(C3)  The  equation 

UP)  =  v,  0  <  n  <  1  (4.1) 

has  a  unique  solution  rj*,  and  for  some  e  >  0, 

[Jc{r)-V]{v-V*  j>0  (4.2) 

whenever  rj  ^  p*  and  \rj  —  r)*\  <  e  in  [0,1]. 

Condition  (C2)  is  relaxed  somewhat  in  Section  7.  It  is  clearly  satisfied  when  is  given 
by  (3.3)  for  then  pxy{fn)  =  7?p*j,(ff)  +  (l  -  »/)p.»(ff)  for  all  x  and  y  in  5.  The  condition  (4.2) 
is  tantamount  to  local  monotonicity  and  in  practice,  is  often  verified  by  establishing  some 
stronger  monotonicity  property  on  77  — »•  Jc(fv )  such  as  (C3bis)  below. 

(C3bis)  The  mapping  [0,  l]  -+  IR  :  rj  Jc{fv)  is  strictly  monotone ,  say  monotone  increasing 
for  sake  of  definiteness. 

If  this  mapping  is  monotone  decreasing ,  or  if  the  inequality  in  (4.2)  is  reversed,  then 
the  stochastic  approximation  algorithm  (3.6)  is  modified  by  replacing  (V  —  c(Xn+i))  with 
(c(Xn+1)-V). 

That  (4.1)  has  at  least  one  solution  follows  from  (3.2)  and  from  the  following  result  which 
is  contained  in  the  proof  of  Theorem  5.4. 

Lemma  4.1  Under  the  assumptions  (Cl)-(C2),  the  mapping  [0, 1]  — ►  IR  :  rj  — >  Jc(fv )  is 
analytic  on  [0, 1]. 

The  main  result  of  this  paper  can  now  be  stated. 

Theorem  4.2  Assume  (8.2)  and  (8.7)  to  hold.  Under  the  assumptions  (Cl)-(CS),  the  follow¬ 
ing  statements  hold  true. 

(i) :  The  sequence  of  estimates  {77n}o°  strongly  consistent  under  Pa ,  i.e., 

\imnr]n  =  Ti*  Pa  -  a.s.  (4.3) 

(ii) :  The  policies  g  and  a  achieve  the  same  cost,  i.e., 

Jc[a)  =  Jc{g)  =  V.  (4.4) 
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The  approach  adopted  here  for  establishing  the  convergence  (4.3)  uses  an  ODE  argument 
based  on  the  deterministic  lemma  of  Kushner  and  Clark  [3]  as  presented  by  Metivier  and 
Priouret  in  [7].  The  key  result  for  the  analysis  is  probabilistic  in  nature  and  is  given  the  next 
proposition  whose  proof  is  delayed  till  Section  6.  To  state  the  result,  consider  the  RV’s  {Tn}o° 
given  by 

Yn  :=  MfVn)  -  c(Xn+1)  n  =  0, 1,  •  •  •  (4.5) 

and  for  every  T  >  0,  pose 

k-i 

m(n,T )  :=  max{/c  >  n  :  a,- 

i—n 

Theorem  4.3  Under  the  assumptions  (Cl)-(C2), 

(  k 

limn  sup  |  aiYi 

\n<k<m{n,T)  i=n 


<T}  .  n  =  0, 1,  •  •  •  (4.6) 

the  convergence 

|^=0  P“  -  a.s.  (4.7) 


takes  place. 

Proof  of  Theorem  4.2.  The  result  (4.4)  on  the  cost  follows  readily  from  the  parameter 
convergence  (4.3)  upon  making  use  of  Theorem  3.1  of  [10]  which  provides  extensions  to  an 
argument  originally  due  to  Mandl  [6,  Thm.  3,  p.  46]. 

As  explained  by  Metivier  and  Priouret  [7],  the  convergence  (4.7)  underlines  the  Pa- a.s. 
convergence  of  {j7n}o°  V* •  The  reader  is  invited  to  consult  [3,7]  for  a  complete  exposition  of 
the  arguments  which  are  now  briefly  summarized:  Interpolate  the  estimate  sequence  {77^)0°) 
say  by  piecewise  linear  functions  [0,  00)  1R  anchored  at  r]n  at  tn  =  J2i=oai,  and  define  a 
sequence  of  left  shifts  rjW(t)  =  v{t  -  tn )  which  bring  the  “asymptotic  part”  of  {r7n}g°  back 
to  a  neighborhood  of  the  time  origin. 

Now  observe  that  the  recursion  (3.6)  can  be  written  in  the  form 


Vn+l  — 


Vn  +  an[(V  -Je(r*))  + 


n  =  0,1,...  (4.8) 


and  that  from  any  convergent  subsequence  {r^m)(-)}§°  a  further  convergent  subsequence 
can  then  be  extracted  by  standard  boundedness  and  equicontinuity  arguments. 
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It  is  then  easy  to  see  from  Theorem  4.3  that  its  limit  ??(•),  and  for  that  matter  the  limit  of  any 
convergent  subsequent,  satisfies  the  ODE 

v(t)  =  V-Jc(r>(%  t  >  0  ,  i;(0)  in  [0,1],  (4.9) 

which  is  asymptotically  stable  with  a  unique  stable  point  r)*,  as  a  consequence  of  (C3). 

A  simple  shifting  argument  now  implies  iq{t)  =  rj*  for  all  t  >  0  and  this  completes  the 
proof.  These  arguments  are  now  standard  and  are  omitted  here  for  sake  of  brevity.  □ 

5.  SOME  REGULARITY  RESULTS 

The  proof  of  the  convergence  (4.3)  is  based  on  the  so-called  ODE  method  as  presented  by 
Metivier  and  Priouret  [7],  This  approach  hinges  crucially  on  the  fact  that  several  quantities 
of  interest  are  Lipschitz  continuous  (in  the  variable  rj)  and  it  is  the  purpose  of  this  section 
to  establish  the  requisite  regularity  properties  in  some  detail.  In  what  follows,  it  will  be 
convenient  to  view  any  mapping  /  :  S  — >  1R.  as  a  d  dimensional  vector  ( f(x ))  (still  denoted  by 
/).  Also,  let  Id  denote  the  d  X  d  identity  matrix  and  let  0^  stand  for  the  1  X  d  row  vector  with 
zero  entries. 

Note  first  that  in  the  special  case  of  (3.3),  condition  (Cl)  follows  from  a  simple  condition 
on  g  and  g. 

Lemma  5.1.  Let  fv  be  given  by  fS.Sj.  If  both  Markov  chains  P(g)  and  P(g)  are  irreducible 
(resp.  aperiodic),  so  is  each  one  of  the  Markov  chains  P(fv),  0  <  rj  <  1. 

Proof.  Note  that  if  for  some  n  =  0, 1,-  •  •  and  some  pair  of  states  x  and  y,  either  pi^  (g)  >  0 

or  Pxy{g)  >  0,  then  piy\fv)  >  0  for  all  0  <  rj  <  1.  The  result  now  follows  readily  from  the 
definitions  of  irreducibility  and  aperiodicity.  j-j 

Under  (Cl),  the  Markov  chain  P(fv)  is  positive  recurrent  for  all  0  <  r]  <  1  (since  S  is 
finite)  and  therefore  possesses  a  unique  invariant  measure  ir(y)  which  is  interpreted  as  a  1  xd 

row  vector  (tt^x)).  It  is  well  known  that  this  invariant  vector  7r(rj)  is  the  unique  solution  to 
the  system  of  equations 

7r  =  7rP(,P),  7red  =  l  (5<1) 

in  the  variable  tt  =  (tt(x))  in  Rlxd  with  ed  denoting  the  d  X  1  column  vector  with  all  entries 
equal  to  unity. 
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The  next  Lemma  is  useful  for  establishing  the  required  regularity  results.  Throughout 
the  discussion,  the  analyticity  of  a  matrix-valued  mapping  is  understood  entrywise. 

Lemma  5.2  If  the  mapping  [0,  l]  — >  1R  :  77  — »  T&dxd  :  77  — >  A(t?)  =  ( Axy(r 7))  is  analytic  with 
the  property  that  the  inverse  A-1  (77)  of  A(rj)  exists  for  every  77  in  [0,1],  then  the  mapping 
[0,  l]  — >  IR  :  77  — »  IRdXli  :  77  —r  A_1(? 7)  is  analytic  on  [0, 1]. 


Proof.  By  standard  results  from  Linear  Algebra,  there  exist  d?  +  1  polynomial  functions 

j2  d2  • 

tq  :  IR  — ►  ]R  and  rxy  :  IR  — >  IR,  with  x  and  y  ranging  in  5,  in  dr  variables  A  —  (Axy)  such 
that 


{v)xy  — 


rxy{A{v)) 

r0(A(r?)) 


(5.2) 


for  all  x  and  y  in  S  and  all  0  <  rj  <  1.  Here,  these  polynomial  functions  are  of  degree  at  most 
d  and  the  relation  r0(A(r]))  —  det  A(rj)  ^  0  holds  for  all  0  <  77  <  1. 

It  now  follows  from  the  expression  (5.2)  that  the  mapping  77  — >  A-1  (77)3^  is  rational  for 
all  x  and  y  in  S,  thus  analytic  throughout  [0,1]  except  possibly  at  a  finite  number  of  points 
where  the  function  may  exhibit  poles.  However,  t-o(A(t7))  is  analytic  in  77  and  has  no  zero,  so 
that  the  assumed  analyticity  of  the  mapping  77  — >  A( 77)  precludes  the  existence  of  poles  for 
each  one  of  the  mappings  77  -7  A~1(rj)xy  for  all  x  and  y  in  S.  □ 

The  smoothness  of  the  components  of  77(77)  can  now  be  investigated. 

Lemma  5.3  Under  (Cl)-(CS),  the  mapping  [0, 1]  — ►  IR  :  77  — *  77(77,2)  is  analytic  for  every  x 
in  S. 

Proof.  The  equations  (5.1)  satisfied  by  the  invariant  vector  can  be  rewritten  more  compactly 
as 

*Q{v)  =  [°d  1]  (5.3) 

where  Q  (77)  is  the  d  X  (d  +  1)  matrix  given  by 


Q(v):=[Id~P(n  ed}.  (5.4) 

Consider  the  d  x  d  matrix  <5(77)  obtained  from  <3(77)  by  removing  its  first  column.  Since 
the  invariant  measure  is  uniquely  determined  by  (5.1),  it  is  plain  that  77(77)  is  the  unique 
solution  to  the  vector  equation  rrQ(rj)  =  [0rf_x  l]  with  an  obvious  interpretation  for  0^. 
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Consequently  Q(rj)  is  invertible  and 

*(v)  =  [°i-i  1  ]Q{v)~l  ■  (5.5) 

The  mapping  77  — »•  Q(r))  is  clearly  analytic  on  [0,  l]  due  to  (C2)  and  the  result  readily  follows 
from  Lemma  5.  2.  n 

It  is  worth  pointing  out  that  under  (Cl), 


Vim.n——Ep 
n  +  1 


E  >1*  = 


Li=0 


=  *{v,x) 


(5.6) 


for  all  x  in  S  (independently  of  the  initial  distribution),  whence 


Jc[fv)  —  limr 


1 


n  +  1 


-E*' 


E  *(*<) 


U=0 


7riv,x)c(x) 


(5.7) 


Of  interest  here  are  the  Poisson  equations  associated  with  the  cost  c  under  the  policies 
/"» 0  <  77  <  1.  More  precisely,  the  mapping  h  :  S  — ►  1R  and  the  constant  J  (in  1R)  solve  the 
Poisson  equation  (associated  with  c)  under  policy  fv  if 

h{x)  +  J  =  c{x)  +  J2yP*v  (fV)h(y)  (5.8a) 

for  all  x  in  S,  or  in  equivalent  matrix  form, 


h  +  Jed  =  c  +  P(fv)h  .  (5.8 b) 

It  is  clear  that  if  the  pair  (J,h)  solves  (5.8)  so  does  (J,h  +  aed)  for  every  a  in  JR.  Moreover, 
it  is  well  known  that  if  the  pairs  (Jj,^)  and  (J2,h2)  both  solve  (5.8),  then 


J\  —  J2  =  limr 


1 


n-4-1 


-E*' 


I>w) 


i=0 


(5.9) 


and  hi  —  h2  is  constant  on  recurrent  classes. 

As  pointed  out  earlier,  the  Markov  chain  P(/r?)  has  a  single  positive  recurrent  class  under 
(Cl)  (for  each  0  <  77  <  l),  in  which  case  the  Poisson  equation  (5.8)  has  exactly  one  solution 
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{Jc{ri),h(ri))  where  h(??)  :  S  — >  IR  is  determined  up  to  an  additive  constant  [11,  Thm.  4.1].  A 
particular  representative,  still  denoted  h(rj),  is  now  described.  Before  giving  this  definition,  it 
is  convenient  to  observe  that 


Jc{v)  =  limr 


n  +  1 


-Er> 


X>(*) 


L;=o 


=un 


(5.10) 


as  a  result  of  (5.9). 

Define  the  stochastic  matrix  P*(fv )  by 


P*(/'):=  lim*  £>(/’) 

z=0 


(5.11) 


This  limit  exists  under  (Cl)  by  virtue  of  elementary  results  in  the  theory  of  Markov  chains 
[2].  Since  P(fv)  has  a  single  recurrent  class,  it  is  plain  from  (5.6)  that  all  the  rows  of  P*(fn) 
are  identical  to  7r  (77) ,  so  that 

P*(fv)  =  edn{v)  (5.12) 

for  all  0  <  r\  <  1. 

It  is  now  a  simple  exercise  to  see  that  the  eigenvectors  of  P*  coincide  with  those  of  P ,  and 
that  the  matrix  G{r})  :=  P(fn)  -  P*(fv)  has  spectral  radius  strictly  less  than  unity,  whence 
Id.  ~  G(r))  is  invertible.  For  all  0  <  rj  <  1,  the  mapping  h(r))  :  S  — ►  1R  is  now  defined  by 

h(v)  :=  [Id  -  G(rl)}~1[Id  -  P*(fv)]c.  (5.13) 

Simple  algebraic  manipulations  show  that  the  pair  («/c (77) ,  h(rj))  given  by  (5.10)  and  (5.13) 
solves  the  Poisson  equation  (5.8),  since  Jc(v)ed  =  ej7r(r?)c  =  P*(fv)c  by  virtue  of  (5.7)  and 
(5.12). 

Theorem  5.4  Under  the  assumption  (Cl)-(C2) ,  the  solution  pair  to  the  Poisson  equation  (5.8) 
given  by  (5.10)  and  (5.18)  is  analytic  on  [0, 1],  i.e.,  the  mappings  [0,  l]  — >  IR  :  ?7  — >■  Jc(fv )  and 
[0, 1]  — *  IR  :  77  — >  h(r),x),  with  x  ranging  over  S,  are  all  analytic. 

Proof.  Since  5  is  finite,  the  analyticity  of  the  mapping  r\  — >  Jc(rj)  is  an  immediate  consequence 
of  Lemma  5.3  in  view  of  (5.7)  and  (5.10). 
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The  matrix- valued  function  77  — >  P*(fv)  is  analytic  on  [0,1]  as  a  result  of  the  represen¬ 
tation  (5.13)  and  of  Lemma  5.3,  It  is  now  plain  that  the  mappings  77  — >  I d  —  P*(fn)  and 
77  — ►  Id  —  G(rj)  are  both  analytic  on  [0, 1],  and  the  result  now  follows  from  Lemma  5.2.  □ 

As  a  consequence  of  Theorem  5.4,  since  S  is  finite,  there  exists  a  positive  constant  K  such 

that 


\Jciv)  ~  Jc{v)\  <  K\rj  -  fj\  and  supx\h(r),x)  —  h(rj,x)\  <  K\r)  —  fj\  (5.14) 

for  all  0  <  77, 77  <  1. 

6.  A  PROOF  OF  THEOREM  4.3 

This  section  is  devoted  to  the  proof  of  the  a.s.  convergence  result  (4.7).  It  is  plain  from 
Theorem  5.4  that  for  each  x  in  S,  the  mapping  77  — >  h(ri,x )  is  continuous  on  [0,1],  thus 
bounded  and  therefore 

B  :=  sup^sup^  |  h(rj,x )  |<  00  (6.1) 

since  S  is  finite.  Moreover,  with  the  simplified  notation  E n  for  the  expectation  operator  Ef" , 
the  Poisson  equation  (5.8)  easily  implies  that 

Ev[h( v,Xn+1)  |  Fn]  =  h( r},Xn)  +  Jc(t7)  -  c[Xn )  n  =  0, 1,  •  •  •  (6.2) 

for  all  0  <  77  <  1,  whence 

|  E*[h{ v,Xn+1)  |  Fn]  -E*[h(t,Xn+1)  |  Fn]  | 

.  ,  ,  ,  ,  v  77  =  0, 1,...  (6.3) 

—  |  h(rj,Xn )  —  h(rj,Xn )  +  Jc{v)  ~  Jc{v)  |  <  2 K  |  77  —  77  | 

by  making  use  of  (5.14). 

It  follows  from  (5.8)  that 

~Zn  ^(-^fn+l)  Jcfyn) 

—  hiVn,  Xn+ 1)  —  EVn[h{r)ri,  Xnjt2)  |  4F n.-f- 1  ] 

=  Zn^  +  ^n2)  +  Zn ^  77  =  0,  1,  •  •  •  (6.4) 

with 

Zn  ^  ^{Vn,  Xn+1)  —  EVn  [h(rjn,  ATn+1)  |  Fn]  (6.5 a) 
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(6.56) 


42)  :=  E^[h(Vn,Xn+1)  |  F«]  -  E^[h{Vn+uXn+2)  |  Fn+1] 


and 

Zi3)  :=  Er>^[h{Vn+uXn+2)  I  Fn+1]  -  E^[h(Vn,Xn+2)  |  Fn+1] 
for  all  n  —  0, 1,  •  •  •.  Define  the  RV’s  { S ^ }q°  for  all  k  =  1,2,3,  by  posing 


(6.5c) 


n—  1 


=  £  a.Z\ 


(k) 


n 


=  1,2,...  (6.6) 


t=0 


with  Sq1^  =  Sq2^  =  Sq3^  =  0.  It  now  suffices  to  show  that 


limn  [  sup  |  S"aiz\k)  |)=0  Pa  -  a.s.  (6.7) 

\n<i<m(n,T)  {=n  J 


for  all  T  >  0  and  all  k  =  1,2, 3. 

It  is  plain  that  the  RV’s  {Z^} g°  form  a  (P“,Fn)  martingale-difference,  whence  {■S'i1^}§0 
is  a  zero  mean  (P“,  F„)-martingale.  Routine  calculations  show  that 


sup„£7“[|  S<’>  | ,]  =  sup>B. 


n  — X 

£■ 

i= 0 


(1)  12 


<  4J3 


Eai 

i=0 


(6.8) 


upon  using  (6.1)  and  (3.7),  and  the  (P“,  Fn)-martingale  {S^}?  is  thus  uniformly  integrable 

under  Pa.  By  the  Martingale  Convergence  Theorem,  the  RV’s  {S^jg0  converge  a.s.  under 

P“  (to  an  a-s-  finite  limit),  in  which  case  they  form  a  Cauchy  sequence  P“- a.s.  and  (6.7) 
follows  for  k  =  1. 

To  prove  (6.7)  for  k  =  2,  note  that  for  all  0  <  n  <  £,  the  relation 

=  X>42) 

i~n 

t 

=  “  -  aO^MM^i+i)  |  F<] 

i~n 

+  an-X^'’"[fc(»7n,  ^n+l)  I  Fn]  -  o^'"’+1[/i(7?£+i,X£+2)  I  Fm]  (6.9) 
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holds.  It  is  now  plain  from  (6.1)  that 

i 

|  SU  —  |<  B  5>-i  —  ai)  +  B(an- 1  +  ai)  (6.10) 

i—n 


<  2Ban_i  (6.11) 

upon  telescoping  the  terms  in  the  first  sum  on  the  right  handside  of  (6.10)  and  making  use 
of  the  monotonicity  of  the  weight  sequence  {an}§°.  The  conclusion  (6.7)  for  k  —  2  is  now 
immediate. 

Finally  for  k  —  3,  note  from  (6.3)  that 

|  ZM  |<  2K  |  r,n  -  Vn+i  \  n  =  0,1,---  (6.12) 


whereas  the  recursion  (3.6)  implies 

|  Vn+l  Vn  1 5:  an+ 1  |  V  ~  c(Xn- |-l)  °n-flB  n  6,  1,  ■  •  •  (6.13) 

with  B  —  V  +  supx  |  c(x)  |.  By  combining  (6.12)  and  (6.13),  the  inequality 

I  |<  2BKan+1  n  =  0,1,-  ••  (6.14) 

is  seen  to  hold  and  since  {an}o°  decreasing,  this  yields  the  bound 


sup 

n<l<m(n,T ) 


m(n,T ) 

£  E 


(3) 


m(n,T) 

|<  2 BK  ai  ^  2 BKan{T  +  an)  .  (6.15) 


The  convergence  (6.7)  now  follows  from  (3.7).  j-j 

7.  CONCLUDING  REMARKS 

The  results  of  this  paper  can  be  given  the  interpretation  either  of  an  estimation  procedure, 
where  the  estimated  parameter  is  defined  through  (3.4),  or  of  an  adaptive  implementation 
scheme,  where  the  controls  are  generated  “on  line”  through  (3.6).  The  paper  concludes  with 
an  application  to  constrained  MDP’s  and  with  several  extensions  of  the  results. 

Constrained  optimization 
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The  results  of  Section  4  have  an  immediate  application  to  the  following  problem.  Let 
c  and  d  be  two  cost  functions  S  — >  1R  and  denote  the  corresponding  long-run  average  costs 
incurred  by  an  arbitrary  policy  7  in  IT,  as  defined  in  (3.1),  by  Jc( 7)  and  Jd{ 7),  respectively. 
With  IV  {7  E  II  :  Jc{l)  >  H}  for  some  V  in  IR,  consider  the  constrained  optimization 
problem 

Maximize  Jd{-)  over  TV- 

In  the  event  c  <  0  and  d  >  0,  the  problem  has  a  natural  interpretation  of  maximizing  the 
reward  subject  to  a  bound  on  the  cost.  Assume  henceforth  that  IV  is  non-empty  and  strictly 
contained  in  IT,  so  that  the  problem  is  feasible  but  not  trivial. 

Beutler  and  Ross  [l]  have  shown  that  if  U  is  compact  and  if  the  mappings  u  — >  pxy[u)  are 
continuous  for  all  z  and  y  in  S,  then  there  exist  two  Markov  deterministic  policies  g  and  g  so 
that  (3.2)  holds.  Moreover,  if  fv  is  given  by  (3.3),  then  7  — ►  Jc{fn)  is  continuous,  and  if  7* 
solves  (3.4),  then  g  =  fn  is  a  solution  to  the  constrained  optimization  problem. 

Applying  the  results  of  Theorem  4.2,  it  follows  that  if  7  — ►  Jc{fn)  satisfies  condition 
(C3),  then  the  policy  a  obtain  through  (3.5)-(3.6)  satisfies  Jc{a)  =  Jc(g )  =  V.  Similarly, 
J<i{ol )  =  Jd(g )  and  a  solves  the  constrained  optimization  problem. 

Extensions 

The  results  of  this  paper  can  be  obtained  under  regularity  conditions  which  are  much 
weaker  than  (C2).  One  possible  set  of  conditions  under  which  the  analysis  carries  through  is 
stated  as  condition  (C2bis)  below,  where 

(C2bis)  The  transition  probabilities  7  -+  Pxy{p)  are  Holder  continuous  for  all  z  and  y  in  5, 
i.e.  there  exist  constants  K  >  0  and  0  <  /?  <  1,  such  that 

\Pxy{fV)  ~  Pxy{ff')\  <  K\rj  -  rjl13  (7.1) 

for  all  x  and  y  in  S. 

In  exact  parallel  with  the  developments  of  Sections  5  and  6,  conditions  (Cl),  (C2bis)  and 
(C3)  are  sufficient  to  guarantee  that 

(i).  For  all  x  in  S,  the  mapping  7  -4  ^(7,2)  is  Holder  continuous  with  parameter  (3. 

(ii):  The  mappings  7  ->  Je{P)  and  7  -  h(r),x),  with  z  ranging  over  5,  are  all  Holder 
continuous  with  parameter  (3. 
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(iii):  If  {f?n}o°  is  given  by  (3.6),  then  (4.3)  and  (4.4)  hold. 

The  proofs  of  (i)-(ii)  are  identical  to  the  ones  given  for  Lemma  5.3  and  Theorem  5.4, 
respectively,  upon  observing  that  the  class  of  Holder  continuous  functions  with  parameter  /?  is 
closed  under  addition  and  multiplication,  and  under  composition  with  the  function  x  — +  i  on 
closed  intervals  which  do  not  include  0.  The  proof  of  Theorem  4.3  carries  over  with  a  slight 
modification,  namely  that  the  last  term  in  (6.3)  and  (6.12)  needs  to  be  changed  to  2K\q  -  rj\P. 
Modifying  (6.14)-(6.15)  appropriately,  the  last  bound  in  (6.15)  becomes  2B^ Ka^[T  +  an), 
which  converges  to  zero  due  to  (3.7). 

If  the  regularity  postulated  in  (C2bis)  is  changed  to  continuous  differentiability  of  order 
r,  then  the  same  remarks  show  that  the  smoothness  in  (i)-(ii)  will  then  also  be  of  order  r. 
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